JRC Names is a highly multilingual named entity resource for personal and organisation names, called 'entities'. It consists of large lists of names and their many spelling variations (up to hundreds for a single person), including across scripts (Latin, Greek, Arabic, Cyrillic, Japanese, Chinese, etc.). The JRC names resources are updated on a daily basis. The JRC Names annotator is a UIMA wrapper for the JRC names resources. Being a miltilingual named entity resource, the JRC Names annotator is suitable to be added in each of the LPCs in the project.
The paragraph splitter is based on regular expressions „((^.*\S+.*$)+)”. More information can be found in the com.tetracom.uima.text.ParagraphSplitter class code.
This tool is based on regular expressions. The URL and emails contain „ .” (dot), which confuses the subsequent components. Thus, URLs and Emails found in the text are annotated as named entities and skipped by the other annotators in the chain.
The BG tokenizer, BG NP extractor and BG NE recogniser use ParseEst - a generic tool for crafting, compiling and applying linguistic rules. The tool can be used for other languages and for different tasks involving development of language grammars. The rules are formulated in the ParseEst XML based formalism. ParseEst consists of two main modules: lr_builder and lr_engine.
The last component in each language processing chain is the “Performance report” provider. The annotated text is enriched with a performance report containing the overall processing time (in milliseconds) for the whole document, as well as the processing time for each primitive engine (annotator in the chain). Each performance report is then stored in a database so that the LPC can be evaluated in terms of productivity.
ATLAS (Applied Technology for Language-Aided CMS) is a project funded by the European Commission under the CIP ICT Policy Support Programme.